Higher criticism for discriminating word-frequency tables and authorship attribution

نویسندگان

چکیده

We adapt the Higher Criticism (HC) goodness-of-fit test to measure closeness between word-frequency tables. apply this authorship attribution challenges, where goal is identify author of a document using other documents whose known. The method simple yet performs well without handcrafting and tuning; reporting accuracy at state art level in various current challenges. As an inherent side effect, HC calculation identifies subset discriminating words. In practice, identified words have low variance across belonging corpus homogeneous authorship. conclude that comparing similarity new single author, mostly affected by characteristic relatively unaffected topic structure.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Authorship Attribution Using Word Sequences

Authorship attribution is the task of identifying the author of a given text. The main concern of this task is to define an appropriate characterization of documents that captures the writing style of authors. This paper proposes a new method for authorship attribution supported on the idea that a proper identification of authors must consider both stylistic and topic features of texts. This me...

متن کامل

Authorship Attribution Using Word Network Features

In this paper, we explore a set of novel features for authorship attribution of documents. These features are derived from a word network representation of natural language text. As has been noted in previous studies, natural language tends to show complex network structure at word level, with low degrees of separation and scale-free (power law) degree distribution. There has also been work on ...

متن کامل

More than Word Frequencies: Authorship Attribution via Natural Frequency Zoned Word Distribution Analysis

With such increasing popularity and availability of digital text data, authorships of digital texts can not be taken for granted due to the ease of copying and parsing. This paper presents a new text style analysis called natural frequency zoned word distribution analysis (NFZ-WDA), and then a basic authorship attribution scheme and an open authorship attribution scheme for digital texts based ...

متن کامل

Authorship Attribution

Authorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. Recent work in “non-traditional” authorship attribution demonstrates the practicality of automatically analyzing documents based on authorial style, but the state of the art is confusing. An...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The Annals of Applied Statistics

سال: 2022

ISSN: ['1941-7330', '1932-6157']

DOI: https://doi.org/10.1214/21-aoas1544